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REMARKS 

Claims 1 and 4-29 are pending. Claim 1 has been amended to remove means- 
plus-function language. Claims 15 and 26 have been amended to correct minor 
typographical errors. Claims 1 and 4-29 have been amended to replace the word 
"Handsfree" with the word "A." No claims have been canceled, and no new claims have 
been added. Applicant respectfully requests reconsideration of the claims in view of the 
above amendments and the following remarks. 

Claims 1 and 4-29 stand rejected under 35 U.S.C. § 112 as based on a disclosure 
which is not enabling. This rejection is based on the fact that the language of claim 1 
requires fixed filters, but also requires use of a frequency-dependent regularization 
parameter. As an initial matter. Applicant notes that claims 23-29 do not include the 
claim limitation in question. As to the rejection, a beamformer having "fixed 
superdirective filters" means that during operation, the filters do not change over time, 
i.e. the beamformer is non-adaptive. See specification page 1 (explaining the adaptive 
nature of prior art beamformers against which a non-adaptive beamformer as claimed is 
contrasted). The regularization parameter is used during design of filter coefficients for 
the beamformer, and according to claim 1 such coefficients may be determined with 
different values for the regularization parameter corresponding to different frequencies. 
This is not to say, however, that the regularization parameter varies with respect to time, 
however. Indeed, it is well-known in the art for a filter that is fixed with respect to time 
to have response characteristics that vary according to frequency, such as, for example, a 
"low-pass filter." The design of fixed filter coefficients and the use of a frequency- 
dependent regularization parameter also are explained below in greater detail in relation 
to the prior art rejections. Applicant believes the specification clearly describes and 
enables the claims. 

Independent claim 1 stands rejected under 35 U.S.C. § 102(e) as being anticipated 
by Kajala. Kajala lacks both the regularization parameter as required by claim 1, and 
Kajala employs an adaptive filter, whereas claim 1 requires a fixed filter. 
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Claim 1 requires that the superdirective beamformer is a regularized 
superdirective beamformer using a finite regularization parameter |u, that is frequency 
dependent. The Office Action cites Kajala's parameter "D" as being a frequency 
dependent regularization parameter. Applicant does not find use of a regularization 
parameter in Kajala. On the contrary, in Kajala, D represents steering parameters (col. 10 
lines 36-46). Applicant notes that also according to the instant specification, "d" refers to 
a steering vector (see, e.g., page 6). For the Examiner's convenience and for clarification 
purposes as to the meaning of the claim term "regularization parameter" as it would be 
imderstood by those of ordinary skill in the art. Applicant encloses along with this 
response a copy of "Solving Ill-Conditioned and Singular Linear Systems: A Tutorial on 
Regularization," by Arnold Neumaier, at page 6, a relevant excerpt of which is shown 
below: 

4. Tite &mm mtiiamt^.. The key u> the !.r«itrmitit oi' j^i->^<:«::ssf fe{:«f ;<j;isje?fja:. 
&■ ten w«> SihsUl ma all Iktear systfetiis <vl:i(?!fs5 tlie st^isdasd tsisclmiqixes art: isia<Je>- 
(jidatft, is & pmcess called ft'^kmMs&ri s bsii sepi&tx's .4"""^ by & fesasly C'^. (h > (S) ijf 

tkm pimmui&}\ {Mme- g^^wi-i&liy^ iuiy p&t swinste smS-issksg th« tte&i.«.ti(.'i.« oi may 
nfcred to Sis a regiskrisssfciats. psr&meter, ) 

In Applicant's specification at page 7, it is disclosed that: 
The optirnai filler coefficierits A^im) can be oomputeid accordJrtg to 



w:!^ 



wh&min the supersofifpt H cfenotss Hermitfen transposing and r((ii) is itie complex 

It is further disclosed that the equation above is based on assumptions of ideal 
microphones, and that the filter design can be refined flirther through regularization, as 
shown in the following modified equation: 

Thus n is used as a regularization parameter in accordance with the understanding shown 
in Neumaier, excerpted above. 
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Furthermore, claim 1 requires that the regularization parameter be frequency 
dependent. Accordingly, in designing filter coefficients Ai(w) according to the disclosed 
principles, Ai(w) will be computed using a different value for the regularization 
parameter at each value of w. Once the filter coefficients have been determined, 
however, they are not adapted during operation, but remain fixed , as required by claim 1 . 

The beamformer of Kajala, however, does not include fixed filters, as required by 
claim 1 . As Kajala recites, "[i]t is the main advantage of the system that the filter 
coefficients of the beamformer are not fixed but adjustable. One section of Kajala also 
cited in the rejection (col. 6 lines 42-60) in fact describes the general way that such 
adjustable filter coefficients are found (see fiirther col. 6, Unes 35-37: "the adjustable 
filter coefficients ... are approximated by the following. . ."). While Kajala does disclose 
that some of the parameters used in determining the filter coefficients are fixed (col. 6 
line 41), other parameters are variable (col. 6 line 42), meaning that the coefficients are 
not fixed. Because Kajala describes the fact that the filter coefficients are not fixed as the 
"main advantage," rather than anticipating, Kajala directly teaches away from the subject 
matter of claim 1, according to which filter coefficients are required to be fixed. 

Kajala thus fails to teach the limitations of claim 1, and claim 1 is allowable over 
Kajala, because (1) Kajala does not disclose or suggest a frequency dependent 
regularization parameter, and (2) Kajala is directed to an adaptive filter whereas claim 1 
requires a fixed filter. Claims 4-22 depend from claim 1 and are allowable for at least the 
same reasons as claim 1. 

Independent claim 23 stands rejected under 35 U.S.C. § 102(e) as being 
anticipated by Kajala. Similarly to claim 1, claim 23 requires "a superdirective 
beamformer having fixed superdirective filters," and the rejection fails to address this 
limitation, as explained in the discussion of claim 1 above. 

Moreover, claim 23 also requires that "the superdirective beamformers are 
configured with a predetermined susceptibility that is based on a relative error of the 
microphone array." As explained in the specification at page 10, susceptibility is the 
inverse of white noise gain, and the white noise gain describes the ability of the array to 
suppress uncorrelated noise. Claim 23 thus requires that such a susceptibility is 
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predetermined, and that the predetermined susceptibility is based on a relative error of the 
microphone array. The rejection, by contrast, cites Kajala, col. 11, lines 18-42, which 
recites the use of a Chebyshev approximation. Susceptibility and white noise gain are not 
discussed. As Kajala recites, a Chebyshev approximation is a mathematical function that 
is used to approximate the value of a different, continuous function, thus minimizing the 
maximum "error," wherein "error" is the difference between the actual value of the 
function to be approximated and the value generated using the Chebyshev approximation. 
Applicant's claims, on the other hand, require using a relative error of a microphone array 
(i.e., a difference between output characteristics of a theoretically perfect microphone and 
an actual, imperfect microphone) in configuring the superdirective beamformers to have a 
predetermined susceptibility. 

Kajala thus fails to teach the limitations of claim 23, and claim 23 is allowable 
over Kajala. Claims 24-29 depend from claim 23 and are allowable for at least the same 
reasons as claim 23. 

All pending claims are believed to be in a form suitable for allowance. Therefore, 
the application is believed to be in a condition for allowance. Applicant respectfiiUy 
requests early allowance of the application. Applicant requests that the Examiner contact 
the undersigned, Geoffrey L. R. Williamson, if it will assist fiirther examination of this 
application. 

Applicant does not believe any extension of time is required for timely 
consideration of this response. In the event that an extension has been overlooked, this 
conditional petition of extension is hereby submitted. Applicant requests that deposit 
account number 19-4972 be charged for any fees that may be required for the timely 
consideration of this application. 
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SOLVING ILL-CONDITIONED AND SINGULAR LINEAR SYSTEMS: 
A TUTORIAL ON REGULARIZATION 



ARNOLD NEUMAIER * 

Abstract. It is shown that the basic regularization procedures for finding meaningful approxi- 
mate solutions of ill-conditioned or singular linear systems can be phrased and analyzed in terms of 
classical linear algebra that can be taught in any numerical analysis course. Apart from rewriting 
many known results in a more elegant form, we also derive a new two-parameter family of merit 
functions for the determination of the regularization parameter. The traditional merit functions 
from generalized cross validation (GCV) and generalized maximum likelihood (GML) are recovered 
as special cases. 

Key words, regularization, ill-posed, ill-conditioned, generalized cross validation, generalized 
maximum likelihood, Tikhonov regularization, error bounds 

AMS subject classifications, primary 65F05; secondary 65J20 

1. Introduction. In many applications of linear algebra, the need arises to find 
a good approximation f to a vector x e H" satisfying an approximate equation 
Ax « y with ill-conditioned or singular .4 G R™^", given y e IR™. 

Usually, y is the result of measurements contaminated by small errors (noise). A 
may be, e.g., a discrete approximation to a compact integral operator with unbounded 
inverse, the operator relating tomography measurements y to the underlying image 
or a matrix of basis function values at given points relating a vector y of approximate 
function values to the coefficients of an unknown linear combination of basis functions. 
Fr(XiiHnitly, ill-conditioned or singular systems also arise in the iterative solution of 
iionlinoar systems or optimization problems. 

The importance of the problem can be seen from a glance at the following, prob- 
ably incomplete list of applications: numerical differentiation of noisy data, nonpara- 
metric smoothing of curves and surfaces defined by scattered data, image reconstruc- 
tion, fl(H-()iivoliiti()ii of scciueticcs and images (Wioiior filtering), shapo from shading, 
computer-assisted tomography (CAT, PET), indirect measurements and nondestruc- 
tive testing, multivariate approximation by radial basis functions, training of neural 
u("tw()rks, inverse scattering, seismic analysis, parainotor identification in dynamical 
systems, analytic continuation, inverse Laplace transforms, calculation of relaxation 
spectra, air pollution source detection, solution of partial differential equations with 
nonstandard data (backward heat equation, Cauchy problem for parabolic equations, 
equations of mixed type, multiphase flow of chemicals, etc.), .... The surveys by 
Engl [8] and the book by Groetsch [12] contain many pertinent references. 

In all such situations, the vector x = A~^y (or in the full rank overdetermined 
case A'^y, with the pseudo inverse A'^ = {A*A)~^A*), if it exists at all, is usually a 
meaningless bad approximation to :x. (This can be seen from an analysis in terms of 
the singular value decomposition; see Section 6.) Moreover, even when some vector 
f is a reasonable approximation to a vector x with Ax = y, the usual error estimates 
\\x — x\\ < 11^^^ II ll^x — y\\ in the square case or \\x — x\\ < \\A'^\\ \\Ax — y\\ in the 
overdetermined case axe ridiculously pessimistic. 

So-called regularization techniques are needed to obtain meaningful solution esti- 
mates for such ill-posed problems, where some parameters are ill-determined by least 
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squares methods, and in particular when the number of parameters is larger than the 
number of available measurements, so that standard least squares techniques break 

down. 

A typical situation (but by no means the only one) is that most parameters 
in the state vector are function values of a function at many points of a suitable 
grid (or coefficients in another discretization of a function). Refining a coarse grid 
increases the number of parameters, and when a finite amount of (grid-independent) 
data are given, one cannot use a very fine grid with standard least squares, which 
requires m > n. Of course, one expects the additional parameters introduced by 
a refinement of the grid to be closely related to the parameters on the coarse grid, 
since the underlying function is expected to have some regularity properties such as 
continuity, differentiability, etc.. To get sensible parameter estimates in such a case it 
is necessary to be able to use this additional qualitative information. (The underlying 
continuous problem is often an integral equation of the first kind; see, e.g., Wing & 
Zahrt [37] for an easy-to-read introduction.) 

Though frequently needed in applications, the adequate handling of such ill-posed 
linear problems is hardly ever touched upon in numerical analysis text books. Only 
in GoLUB & VAN Loan [11], the topic is briefly discussed under the heading ridge 
regression, the statisticians' name for Tiklionov regularization: and a book by BJORCK 
[4] on least squares problems has a section on regularization. 

The main reason for this lack of covering seems to be that the discussion of 
regularization techniques in the literature (TiKHOXOV [31], Engl et al. [9], Hanke 
[14], Wahba [34]) is usually phrased in terms of functional analytic language, geared 
towards infinite-dimensional problems. (A notable exception is the work by Hansen 
(see Hansen [17], Hanke fc Hansen [15], and tho rocont hook Hansen [18]) that is 
closer to the spirit of the present paper.) This tends to make the treatments unduly 
application specific and clutters the simplicity of the arguments with irrelevant details 
and distracting notation. We summarize the functional analytic approach in Section 2, 
mainly to give those familiar with the tradition a guide for recognizing what happens 
in the rest of the paper. 

However, those unfamiliar with functional analysis may simply skip Section 2, 
since the main purpose of the present paper is to show that regularization can be 
discussed using elementary hut elegant linear algebra, accessible to numerical analysis 
students at any level. The hnear algebra setting is also much closer to the way in 
which regularization methods are implemented in practice. 

Most results derived in this paper (the major exception is the theory in Section 
10 leading to a new family of merit functions) are known in a more or less similar 
form for problems in function spaces. What is new, however, is the derivation from 
assumptions that make sense in a finite-dimensional setting, and with proofs based 
on standard linear algebra. 

Section 3 motivates an abstract and general approach to smoothness conditions of 
the form x = Sw with a vector w of reasonable norm and a suitable smoothness matrix 
S derivable from the coefficient matrix and some assumed order p of differentiability. 
Section 4 derives and discusses the basic Theorem 4.1, giving deterministic error 
bounds that show that regularization is possible. It is also shown that general ill- 
posed problems behave in a way completely analogous to perhaps the simplest ill- 
posed problem, numerical differentiation, for which the cure has been understood for 
a very long time. 

Section 5 specializes Theorem 4.1 to obtain some specific regularization tech- 
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niques. In particular, good approximate inverses for regularization can be derived by 
modifying the standard least squares formula. The traditional Tikhonov regularizar 

tion by means of 

X = {A* A + H^I)-^A*y 

and an iterated version of it are covered by the basic theorem. Section 6 then invokes 
the singular value decomposition to compute more flexible approximate inverses, in- 
cluding a familiar one based on the truncated singular value decomposition. 

In Section 7 we discuss an inherent limitation of regularization techniques based 
on deterministic models - there cannot be reliable ways to determine regularization 
parameters (such as and p) in the absence of information about the size of the 
residuals and the degree of smoothness. However, the latter situation is very frequent 
in practice, and finding valid choices for the regularization parameter is still one of 
the current frontiers of research. 

The remainder of the paper therefore discusses a stochastic framework in which 
it is possible to rigorously study techniques for the selection of the optimal regular- 
ization parameter in the absence of prior information about the error level. We first 
prove (in Sections 8 and 9) stochastic results that parallel those obtained for the de- 
terministic case, with some significant differences. In particular. Section 8 shows that 
the expected squared norm of the residual can be minimized explicitly (a result due to 
Bertero et al. [3]), and leads in the simplest case again to Tikhonov regularization. 

Section 9 expresses the optimal estimator in a more general situation in terms 
of the singular value decomposition and discusses the attainable limit accuracy. The 
resulting formulas are similar to those arising in deconvolution of sequences and images 
by Wiener filters (Wiener [36]), perhaps the earliest practical use of regularization 
techniques. A modern treatment from a practical point of view can be found, e.g., in 
Katsaggelos [19]. 

In the stochastic approach, the regularization parameter turns out to reappear 
as a variance quotient, and this permits its estimation througli variance component 
estimation techniques. In Section 10, which is less elementary than the rest of the 
paper, we derive a family of merit functions whose minimizers give approximations 
to the ideal regularization parameter; the merit functions contain the generalized 
cross validation approach and the generalized maximum likelihood approach as special 
cases. 

Finally, Section 11 extends the stochastic approach to the situation where the 

smoothness condition x = Sw is replaced by the condition that some vector Jx, usu- 
ally composed of suitably weighted finite differences of function values, is reasonably 
bounded. This is again a smoothness condition, but by judicious choice of J, the 
smoothness requirements can be better adapted to particular problems. 

2. Regularization in function spaces. In this section (only), we assume the 
reader to be familiar with concepts from functional analysis; however, for those un- 
familiar with these concepts, there is no harm in skipping the section, since the re- 
mainder of the paper is completely independent of it. 

We shall only give a short sketch of the traditional theory; for more detailed 
references, in depth treatments and applications we refer to the siu'veys mentioned in 
the introduction. In particular, we shall use in this section the notation of the book 
by Engl, Hanke & Neubauer [9]. 

The objective (e.g., in computing the inverse Radon transform in tomography) is 
to solve a linear operator equation of the form Tx = y, where T : X ^ y isa, compact 
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linear operator between two Hilbert spaces X and y\ y € y is given, and x £ X is 
wanted. Typically, X and y are closed subsets of Sobolev spaces of functions, and T is 
an integral operator. In numerical applications, everything needs to be discretized and 
becomes finite-dimensional. Thus x and y are replaced by discrete versions consisting 
of function values or coefficients of expansions into basis functions, and T becomes a 
matrix A with some structure inherited from the continuous problem. 

The choice of Hilbert spaces amounts to fixing norms in which to measure x and 
y, and has in the discretized version an analogue in the choice of suitable scales for 
the norms to be used to assess accuracies. (In R", all norms are equivalent, but for 
numerical purposes, correct scaling may make a crucial difference in the magnitude 
of the norms.) 

In the function space setting, useful error estimates can be obtained only if x 
satisfies some smoothness restrictions. These restrictions taJie two main forms. 

In the first form (due to Tikhonov [31]. cf. [9], Chapter 3), one assumes that x 
is in the range TZ{B) of a smoothing operator B, a product of p alternating factors 
T and its adjoint T* . For problems involving univariate functions, p can be roughly 
interpreted as the number of weak derivatives bounded in the £^ norm. To obtain 
error bounds, one has to assume the knowledge of a bound w on the norm of some w 
with X = Bw. The number uj can be interpreted as a rough measure of the magnitude 
of the pth derivative of x. 

The choice of the smoothness operator B (in the above setting uniquely deter- 
mined by p) is an a priori modeling decision, and amounts to selecting the smoothness 
of the reconstructed solution. Note that even for analytical problems, it is not always 
appropriate to choose p large, since w often grows exponentially with p; consider, e.g., 
x{t) = sm{uit) for large co. Another a priori modeling decision is the selection of an 
appropriate level 6 for the accuracy of the approximation y to Tx. 

The results obtained about particular regularization methods are all of the form 
that if p is not too large and UTa; — '(/|| < S then the regularized solution xs has an error 
lla;^ -a;|| = C>(<5f/(f+i)c<;i/(P+i)). Tlio reduced exponent p/ip+ 1) < 1 reflects the fact 
that the inverse of a compact operator is unbounded, resulting in the ill-posedness of 
the problem. Standard results (cf. Proposition 3.15 of [9]) also imply that this is the 
best exponent of S achievable. 

The construction of an approximation satisfying this error boimd depends on the 
knowledge of p and 5 or another constant involving 6 such as d/oj; by a theorem 
of Bakushinskii [1] (reproved in [9] as Theorem 3.3), any technique for choosing 
r(^giil;ui/ation paraui(^t<TS in tli<^ absi^ncc of information about the error level can be 
defeated by suitably constructed counterexamples whenever the pseudo inverse of T 
is unbounded. 

However, in practice, there is often not enough information available that provide 
adequate values for 6 and p, which limits the deterministic approach. Some results in 
a stochastic functional analytic setting are given in the context of smoothing splines 
by Wahba [34] (Chapter 4.5 and references there). 

The second form of the smoothness restrictions (due to Phillips, cf. [28], Nat- 
TERER [25] or [9], Chapter 8) is given by the assumption that, for some differential 
operator L and some integer k > 0, the norm of L^'x is finite. The theory gives 
results and limitations similar to that for the other smoothness restriction. 

In the discretized version, the second form of the smoothness restrictions is usually 
modeled by imposing the condition that the norm of some vector Jx composed of 
suitable finite difl:erences of the solution vector x is assumed to have moderate size 
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only; cf. Section 11. 

3. Modeling smoothness. At least in the rank-deficient case where the rank 
of A is smaller than the dimension of x, it is clear that some additional information is 
needed to find a satisfactory solution of Ax = y, since infinitely many solutions exist 
if there is one at all. From a numerical point of view, ill-conditioned systems behave 
just like singular ones, and additional information is needed, too. In the applications, 
the correct one among all (ne«ir) solutions is characterized by additional properties, 
most commonly by requiring the "smoothness" of some function, curve or surface 
constructed from x. 

This qualitative knowledge can be modeled as follows. Given a vector w repre- 
senting a continuous function /, for example as a list of function values w-i = f(ti) 
at the points U of some grid, we may apply some form of numerical integration to w 
to arrive at an (approximate) representation of a differentiable function with 
derivative /; and repeating this p times, we get a representation of a p times dif- 
ferentiable function. Algebraically, we have = Sw for a suitable coefiicient matrix 
5, the smoothing matrix. Therefore, we can formulate smoothness of x by requiring 
that X can be represented in the form .r = .S'm' with some vector w of reasonable norm. 
The smoothing matrix S characterizes the additional information assumed about x. 
(More flexible alternative smoothness conditions are discussed in Section 11.) 

In practice, S might be a discretized integral operator augmented by boundary 
conditions. The most convenient way, however, constructs S directly from A and thus 
works without an\- detailed knowledge about smoothness requirements. 

To motivate the recipe we rewrite the numerical differentiation of a p -I- 1 times 
continuously differentiable function j/ : [0, 1] R as an (infinite-dimensional) regu- 
larizatioti pT'obletti. The situation is so simple that no functional analysis is needed, 
and we simplify further by also assuming that y and its first p+l derivatives vanish 
at t = 0 and t = \. Finding the derivative x{t) = y'{t) = S7y{t) can be posed as the 
problem of solving the linear system Ax = y where A is the integral operator defined 



The differentiability assumption says that w = V = Var is continuous and 
bounded, and since A = V~^, we may write this in llie form x = A^w. 

To rewrite this in a form capable of generalization, we note that the adjoint 
integral operator is defined by the formula {A*xi,X2) = {xi,Ax2) in terms of the 



inner product {xi,X2) = I Xi{t)x2{t)dt. With j/j = Axj, we have {xi,Ax2) = 

Jo 

{'1/1:1/2) = "(1/171/2) = {—Axi,X2) by partial integration and our boimdary conditions, 
so that we conclude A* = —A. Up to a sign that may be absorbed into w, we can 
therefore write the condition x = A^w as a; = Sw, where 



is a product of p > 1 alternating factors A* and ^4. 

This simple and powerful general way of choosing the smoothing matrix S where, 
in general. A* is the transposed matrix (the conjugate transposed in the complex 
case and the adjoint operator in the infinite-dimensional situation), works much more 
generally and is sufficient for many practical problems. 
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Indeed, assume that (as in most applications) the operator A is smoothing, in the 
sense that if a; G is a discretized version of a k times differentiable function / then 
y — Ax e H™ is a discretized version of a fc + 1 times differentiable function /. The 
operator A* is generally smoothing, too. Thus, if ?/ G is a discretized version of 
a k times differentiable function g, then A*y G R'" is a discretized version of a fc + 1 
times differentiable function g. (Typically, A is related to some integral operator, and 
A* to its adjoint integral operator.) 

Thus we get smoother and smoother functions by repeating applications of A 
and A* to some vector w corresponding to a boimded continuous function only. This 
suggests that we require that a smooth ;r is represented as x = Sw with a vector w of 
reasonable norm. (Since we need x G R", the final matrix apphed must be A*, which 
is the case for the choice (1).) 

Thus, by requiring that x can be represented in the form x = Sw with a smoothing 
matrix S given by (1), qualitative knowledge on smoothness can be modeled. Note 
that (1) implies 

(2) SS* = {A* AY. 

While X = Sw with (1) is a frequently used assumption, a discussion of how to find a 
suitable value of p is virtually missing in the literature. We shall return to this point 
in Section 10. 

4. The error estimate. The key to the treatment of ill-posed linear systems, 
a term we shall use for all linear systems where the standard techniques are inade- 
quate, is a process called regularization that replaces A~^ by a family Ch (h > 0) of 
approximate inverses of A in such a way that, as h -¥ 0, the product ChA converges 
to / in an appropriately restricted sense. The parameter h is called the regulariza- 
tion parameter. (More generally, any parameter entering the definition of Ch may be 
referred to as a regularization parameter.) 

In order that the Ch may approximate ill-conditioned (or, in the infinite-dimen- 
sional case, even unbounded) inverses we allow that their norm grows towards infinity 
as /i — >• 0. It is usually possible to choose the Ch such that, for a suitable exponent p 
(often p = 1 or 2), the constants 

71 = sup/i||C/j|| 

h>0 

and 

'y2= sup h-P\\{I-ChA)S\\ 

h>0 

are finite and of reasonable size. Then 

(3) \\Ch\\<^, \\iI-CHA)S\\<j2h^. 

The feasibility of regularization techniques is a consequence of the following funda- 
mental deterministic error bounds. The result is valid for arbitrary norms, though we 
shall apply it later only for the Euclidean norm. (However, cf. (18) below, implicit 
scaling may be needed to adjust the data to the Euclidean norm.) 
Theorem 4.1. Suppose that 



(4) 



x = Sw, ll^x - i/ll < A||w;|| 
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for some A > 0. Then (3) implies 



(5) 



\\x- 



ChvW < 




Proof. We have 



\\x - ChvW 



= \\{I-CHA)x + Ch{Ax-v)\\ 
<||(/-C^A):r|| + ||C^|| \\Ax-y\\ 
<||(J-C,^)5«;|| + ||C,||A||«.|| 
<\\{I-CuA)S\\ |HI + ||C„||A|HI 



< l2hF\\w\\ + -^A||w| 
ft 



by (4) and (3). This imphes (5).D 

Note that the two assumptions (3) and (4) Iiave a very dififerent status. (4) is an 
assumption about the problem to be solved, while (3) is a requirement on Ch that we 
may satisfy by choosing the latter in a suitable way. If (4) is violated, nothing can 
be said about the problem, while a violation of (3) only implies that the particular 
algorithm for choosing Ch is unsuitable for solving the problem. 

(The proof shows that one may allow in (4) the more general relation x = Sw + u 
with arbitrary u satisfying (/ — ChA)u = 0, without affecting the conclusion. In some 
applications, this allows a more flexible treatment of boundary conditions.) 

Traditionally, one assumes a residual bound \\Ax — y\\ < e; our form of the bound 
in (4) is motivated by the fact that the optimal h depends on e and w only through 
the quotient A = e/||«;||. Indeed, the bound (5) is smallest when the derivative of the 
right hand side vanishes, giving 



with reasonable factors hidden in the Landau symbol O(-). Here and later, the Landau 
symbols refer to the behavior for tiny A > 0. However, all nonasymptotic results in 
this paper are valid for any value of A. This is important since regular ization is often 
used when the available data (or the assumed model) are of low accuracy only. 

For a well-posed data fitting problem, i.e., one with a well-conditioned normal 
equation matrix A^A, the least squares estimate has an error of the order of A. 
This follows from (5); indeed, Ch = A+ satisfies (5) for h'^ = \\A+\\ = 0(1) with 
-y^ = l^y2 = 0 independent of j3. (One can therefore allow p ^ oo without blowing 
up the bound, so that the right exponent 1 is approached in (7).) 

For an ill-posed problem, the reduced exponent ^ < 1 in (7) (that can be shown 
to be optimal under the assumptions made, cf. Proposition 3.15 of [9]) implies a loss of 
precision due to the ill-posed nature of the problem. Roughly speaking, it means that 
the number of correct digits that can be expected in an approximate solution is only 
a fraction of of those one would expect in a well-posed problem. In particular, as 
familiar from numerical differentiation, the accuracy of the approximation for p = 1 



(6) 
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is only roughly half the number of digits to which the data are accurate. As one 
can see from (5) and the proof, the error is bounded as a sum of two terms. The 
first term, divergent as /i — > 0, is proportional to the error level of the data vector 
y, and the second term, divergent as 7i — >■ oo, reflects the error in the approximative 
inverse. Thus, the situation is completely analogous to that encoimtered in numerical 
differentiation. There (a thorough discussion is in Chapter 8.6 of Gill et al. [10]) 
f'{x) is approximated using inaccurate function values f{x + Th) with absolute errors 
bounded by A. In approximation formulas like 



for forward differences or 

f{x + h)-f{x-h) 



fix) 



h 



for central differences, two terms with opposite behavior for /i 0 and h -¥ oo arise. 

As in (6)-(7), an optimal choice of h = 0(A^/'-^) in the first case and h = 0{A^^^) 
in the second case gives the approximation orders 0{A'-^'^) for the error of forward 
differences and 0(A^/^) for the error of central differences. 

In order to evaluate (6) for a specific problem one needs to know 7i,'y2,P and 
A. Usually, p is assumed to be fixed, typically at p = 1 or p = 2; then 71,72 can be 
determined from the formula used to define C^; cf. Section 5. On the other hand, in 
practice, one hardly knows precise values for A. If we choose instead of the optimal 
value (6) some value 

h = 7o5i/(''+i) 

with a guess 6 for A and some positive constant 70, we find 

\\x - ChvW < const. max(A,5)(J-i/(P+i). 

This immediately shows that, with the same relative error in S, a choice S > A has 
less influence on the size of the bound than a choice ^ < A, at least when p > 1. 
Thus if in doubt, S should be taken conservatively large. (In the context of solving 
nonlinear systems or optimization problems, the overestimation is usually made good 
through very few additional iterations, unless 6 is chosen excessively large, while a 
too small choice of 6 may be disastrous.) 

Note also that (6) implies that, generally, problems with larger p are solved with 
a larger value of h, at least as long as 71A/72P remains mucli smaller than 1. 

5. Choosing approximate inverses. For a well-conditioned approximation 
problem Ax « y, the residual ||j4x — y||2 becomes smallest for the choice 

(8) x=iA*A)-'A*y. 

When A is rank deficient or becomes increasingly ill-conditioned, this choice is impos- 
sible or increasingly useless. We may improve the condition of A* A by modifying it. 

The simplest way to achieve this is by adding a small multiple of the identity. Since 
A* A is symmetric and positive semidefinite, the matrix A*A + ti^I has its eigenvalues 
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in + \\A\f] and hence a condition number < {h^ + \\A\\'^)/h^ that becomes 

smaller as h increases. With this replacement, the formula (8) turns into 

(9) X = (A* A + h^I)-^A*y, 

a formula first derived by Tikhonov [31] in 1963. He derived it by solving the 
modified least square problem 

\\Ax-yf + h^\\xf = mini 

Formula (9) corresponds to the family of approximate inverses defined by 

(10) Ch = {A*A + h^I)-'^A*. 

More generally, we may consider using other matrix functions of A* A as approximate 

inverses. Indeed, arbitrary functions tp defined on 1R+ can be extended to functions 
of T = h~'^A*A. For example, if ip is continuous, we may approximate it by a 
sequence of polynomials ipi with (fi(t) = lim fiit) uniformly for \t\ < \\T\\, we have 

i— >-oo 

<fiT) = lim <pi{T). 

l—yoo 

Theorem 5.1. For any smoothing mutrix S satisfying (2), the bounds (3) hold 
in the 2-norm for the family of approximate inverses 

(11) Ch = h-'^>p{h-'^A*A)A* 
for any function f : 11+ IR such that the constants 

71 = sup \<pit)t^/^\, 72 = sup |1 - <p{t)t\tP/'^ 

t>0 t>0 

are finite. 

Proof. We use the fact that the 2-norni of a itiatrix B is given by the square root 
of the supremum of the eigenvalues of the matrix BB* . If we denote the eigenvalues 
of T = h~^A*A by ti then the tj, are nonnegative. Now the eigenvalues of 

(12) ChC; = h-*^{T)A*Aip{T) = h-\{TfT 
are h~'^Lp{ti)'^ti < h^'^^f, whence ||C';,,||2 < h^^^i. Similarly, since 

(13) R:= I - ChA = I- h-'^ip{T)A'A = 1- ip{T)T 
and SS* = {A*A)p = {h^Ty. the eigenvahies of 

(14) {RS){RS)* = RSS*R = (I - <f{T)T)'\h^T)P 

are (1 - 'p{ti)tif {hHif < h^P'yi, whence ||(J - ChA)S\\2 = WRSh < /i*'72.D 
In particular, Tikhonov regularization (9) is obtained by choosing 

(15) ^{t) = j^, 
and for this choice we find 

tV2 1 
71 =sup— - = -, 

t>0 t + 1 2 
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r 1/2 ifp=l, 

72 = sup — — = < 1 if P = 2, 
*>o* + ^ [ 00 ifp>2. 

Thus Tikhonov's solution formula (9) can exploit smoothness only up to a degree 
p < 2, since smoother solutions have to be treated by taking p = 2. In particular, 
Tikhonov regularization cannot approximate better than up to 0{A^^^), cf. (7). 

Iterated Tikhonov regularization. To obtain smoother solutions one may use 
the formula (9) with iterative refinement (on the unregularized system): 

and, for I = 1,2,. .., 

(16) x('> = x('-i) + {A* A + h^I)-^A*r^'-^\ =y- ^x^. 

This is commonly called iterated Tikhonov regularization, but it can also V>e considered 
as a preconditioned Landweber iteration. (The Landweber iteration itself is defined by 
dropping in (16) the factor (A* A + h'^I)~^ and can be treated in a similar way, but 
it is too slow to be useful in practice.) Actually, (16) predates Tikhonov's work and 
was found already by Riley [29] in 1956. 

Note that iterated Tikhonov regularization is not the same as iterative refinement 
for solving the regularized normal equations (9). (The latter is obtained by using in 
place of ^*r('-i) the vector A*y - [A* A + h^I)x'^'-^^ = ^V^'"') - h^x^'-^\ and has 
inferior approximation powers.) 

Proposition 5.2. We have 

(17) xW = ft-Vi {h-'^A*A) A*y, 
where 

^l{t) = 7 fl - ^ 



t\ {t + 1) 



Proof. This holds for i = 0 since ifio{t) = 0. Assuming (17) for Z — 1 in place of /, 
we have, with T = h~^A*A, 

= {I -T^i_i{T))A*y = MT)A*y, 

where 
Thus 

x» = xC-i) + {A* A + h^I)-^A*r('-^^ 

= h-''^,^,{T)A*y + {H'T + K^I)-^^i{T)A''y 
= h-\i{T)A*y 



A TUTORIAL ON REGULARJZATION 



11 



Since 



t \ {t + iy-^j {t + i 
1 ^ + 1 t 

= T 1-7^^ + 



t\ (t + iy {t + iy 



Thus (17) holds generally.Q 



Therefore, Theorem 5.1 apphes with ip{t) = ^pi{t) , and Theorem 4.1 gives an 
error bound for S; = a;*^'^ defined by (17). Now it is easy to see that this choice leads 
to 7i < 00 for alH > 0 and 

= sup TIXTu < 00 for p < 21. 
t>o (t + 1)' 

Thus we are able to exploit smoothness for all orders p <2l. 

Inspection of the proof of Theorem 5.1 shows that the supremum needs only 
be taken over alH € [0, /i~^||j4|P]. This implies that one may get reasonable values 
for 7i and 72 also by choosing for 9 suitable polynomials. This allows Ch'y to be 
computed iteratively using only matrix- vector multiplications with A and A*. For 
example, one can use it in the context of conjugate gradient methods. This makes 
regularization a viable technique for large-scale problems. For details, see Brakhage 
[5], Nemirovski [26], Hanke & Raus [16]. A different technique for the large- 
scale case, allowing more general linear and even nonlinear regularization conditions, 
is discussed in KAUFMAN & Neumaier [20, 21]. 

Scaling. In many applications, such as nonlinear systems and optimization, the 
linear systems that need to be solved may be badly scaled, and it is advisable to 
apply the preceding results to scaled variables x = Dx related to x by a diagonal 
transformation matrix D that ensures a common magnitude of all variables. (This 
is equivalent to using Theorem 4.1 for a scaled Euclidean norm; hence corresponding 
error estimates are valid.) 

If an appropriate scaling matrix D is known, the transformed system to be solved 

is 

(18) Ax = y, A = AD-^, 
and the iteration (16) becomes 

= ^('-1) + {A*A + h'^I)-^A*r^^-^\ rW =y- Js^. 
In terms of the original variables x^^^ = D~^x^'\ this can be rewritten as 

(19) = x^'-^^ + {A* A + h^D^)-^A*r^'-^\ =y- Ax^'K 

Note that (19) can be computed efficiently with a single matrix factorization. 

If no appropriate scaling matrix D is available, a natural choice is often (but not 
always) given by the diagonal matrix with diagonal entries 



(20) 



Dkk = ViA*A)kk = \\A..kh, 
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where A.j; denotes the fcth column of A. With this choice, the results are invariant 
under rescaling of the right hand side and the \-aiiables (and corresponding scaling 
of the rows and columns of A). In this scale, using A* A + K^D'^ in place of A* A 
amounts to multiplying the diagonal entries A*Ahy k = \ + h"^ before doing the 
factorization. 

Since the iteration (19) allows one to use in the bound (5) higher and higher values 
of p, one can neglect the second term in the parenthesis in (5) if h is not too close to 
1, and make the first term small by choosing h largest subject to this requirement. 
The value h = 0.1 corresponding to k = 1.01, is a good compromise. 

When used in a context where inaccuracies in the solution can be corrected at 
later stages, a simple stopping criterion for the iteration (19) is to accept a;*^'^ when 
either ||r'*^''|| is sufficiently small or when Ha;*'"*"^^!! exceeds a certain bound. Suitable 
thresholds are usually available from the context. 

6. Using the singular value decomposition. We now assume that we know 
a singular value decomposition (SVD) of A, 



For A s R™^", {7 is a square orthogonal m x m-matrix, V a square orthogonal nxn- 
matrix, and S = Diag(CTi ..... a,, ) a rectangular diagonal m x n-matrix with diagonal 
entries Sj, = at. For details see, e.g., Golub & van Loan [11]. 

It is well-known that the minimum norm solution of the least squares problem 
\\Ax — j/lll = min! is given by 



where the ith column Vu of V is the singular vector corresponding to the ith singular 
value ai, and 



Ill-conditioned matrices are characterized by the presence of tiny singular values a,, 
and it is clear from the representation (22) that errors in y that show up in the 
corresponding components {U*y)i are drastically magnified. Therefore, the minimum 
norm least squares solution is useless for problems with tiny but nonzero singular 
values. 



Using the SVD, one can give meaning to (11) even for irrational functions ip. 
Indeed, then A* A = VJ:*i:V* and {A* A)' = V{i:*T,yV* for / = 0, 1, 2, . . ., so that 



for all polynomials ip, and by a limiting process also for arbitrary continuous ip. 
Therefore, (11) becomes 



(21) 



A = Ui:V*, U, V orthogonal, E = Diag((Ti, . . . ,o-„). 



(22) 






r* 



(23) 
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This differs from (22) in the replacement of the ill-conditioned diagonal matrix E+ 
by a better behaved diagonal matrix E^. In particular, the choice 

^ ' \ 0 otherwise 
removes the small singular values from the sum (22) and leads to 

(24, ^>=Dia,«,.), ^i;^^*- 

The use of (23) and (24) is referred to as regularization by a truncated singular value 
decomposition (TSVD). It has 71 =72 = 1 independent of the degree p of smoothness. 
(However, p is still needed to find the optimal h in (6).) 

The alternative choice ^p{t) = 1/ max(l, t) also works for all p. It has 71 = 1, too, 

but a smaller constant 72 = _^^ ( ^ ^ Y^"^ < 1, suggesting a slight superiority over 

TSVD. 

In applications where scaling problems may arise, it is again advisable to apply 
these formulas to the scaled problem (18) with D defined by (20). 

7. The difficulty of estimating A. So far, we assumed that A or an approx- 
imation « A was known. After having chosen (p. one can compute 71 and 72 from 
Theorem 5.1, and then choose an optimal regularization parameter by (6). To com- 
plete the discussion it is therefore sufiicient to consider ways to estimate A from the 
information in A and y. 

At first sight this seems an intractable problem, since when A is unknown then 
the assumption (4) is virtually vacuous. Indeed, by a theorem of Bakushinskii [1] 
(reproved as Theorem 3.3 in Engl et al. [9]), any technique for choosing regularization 
parameters in the absence of information about the error level can be defeated by 
suitably constructed counterexamples, and indeed the techniques in use all fail on a 
small proportion of problems in simulations where the right hand side is perturbed 
by random noise. 

As a consequence, the literature about choosing regularization parameters (see 
Hanke & Hansen [15] for a recent review) is surrounded by a sense of mystery and 
typically based on heuristics and the study of model cases with special distributions 
of singular values and special right hand sides. 

The only arguments of a more rigorous nature, obtained in the context of smooth- 
ing splines and summarized in Chapter 4.5 of Wahba [34], are based on stochastic as- 
sumptions. But they are discussed using arguments inaccessible to people not trained 
in statistics, and this makes the approach difficult to understand. As we shall show 
now, a stochastic approach is feasible in the general situation, and with only elemen- 
tary machinery. 

In view of the above remarks, we can only hope to construct estimates of A that 

work most of the time. We formalize this in a stochastic setting. As a bjrproduct, we 
are also able to find useful estimates for the order p of differentiability. 

8. Stochaistic error analysis. We suppose that we are solving a particular 
problem 



(25) 



y = Ax + 1, X = Sw 
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from a class of problems chaxax;terized by stochastic assumptions about e and w. More 
specifically, we assume that the components and Wk are uncorrelated random vari- 
ables with mean zero and variance and t^, respectively. In terms of expectations, 
this assumption implies that 

(ciWfc) — 0 for all i, k, 
i^i^k) = {wiWk) = 0 for i ^ A;, 

and 

{el)=a\ {wl) = r' for all fe. 

Here {z) denotes the expectation of a random variable or random vector z. In vector 
form, these conditions become 

(26) {ee*)=(T2/, (ew*) = 0, {ww*)=t'^I. 

Since the expectation operator {•) is linear, we can compute from (25) expectations 
of arbitrary quadratic forms in e and w. In particular, 

{xx*) = {Sum*S*) = S{ww*)S* = t'^SS*, 

(27) {xy*) = {Swiw*S*A* + e*)) = t^SS^'A*, 

{yy*) = {{ASw + e){w*S*A* + e*)) = t^ASS*A* + a^I. 

Since <t measures the size of e = y — Ax and t measures the size of w, the quotient 

(28) A = (t/t 

is an appropriate measures foi' the relative noise level in the stochastic case; cf. Miller 
[24]. Because of the stochastic assumptions made, A can be estimated from an avail- 
able right hand side; see Section 10. Once A is known, the following result (due to 
Bertero et al. [3]) gives the best one can hope to got. {tiB denotes the trace of a 
square matrix B.) 

Theorem 8.1. The error term {\\x - Cy\\'^) /Is mnmual for C = C, where 

(29) C = {SS*A*A + A:^I)-^SS*A*. 

For this choice, the optimal estimator x = Cy is computable as x = Sw from the 
solution of the positive definite symmetric linear system 

(30) {S*A*AS + A^/)w = S*A*y, 
and we have 

(31) (\\x-xf)=THiiSS* -CASS*). 



Proof We look at the behavior of the error term when C is replaced by a perturbed 
matrix C + E. Since 

||x-(C + E)y|p =\\x-Cy-EyW' 

= \\x-Cyf-2ix-CyrEy + \\Eyr 
= \\x - Cyf - 2tvEyix - Cy)* + {{Eyf, 
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we have 

(||x - (C + E)yf) = {\\x - Cyf) - 2tvE{y{x - Cy)*) + {\\Eyf). 
The trace term vanishes for the choice 

(32) C = {xy*){yy*)-^ = S S* A* {t^ AS S* A* + ct'^I)-\ 

since 

{y{x-Cyr)={yx*)-{yy*)C*=0. 

Hence 

{\\x -{C + E)yr) = {\\x - Cyf) + {\\Eyf) > {\\x - Cyf). 

Putting E = C - C, -we see that (||a; - Cyf) is minimal for C = C. Now K := 
t^{t^ASS*A* + (T^/)-! is the solution of 

{tKASS*A* +a^I)K = r^I. 

Multiplying by t-'^SS*A*, we find {SS-A-ASS-'A* + A'^SS*A*)K = SS*A*, hence 

SS*A*K = {SS^A^A + A2/)-i55*A*. 

By inserting this into (32), we get the formula (29). Moreover, 

iWx-Cyf) ={tT{x -Cy){x-Cy)*) 

= tr{xx*) - 2trC(yx*)+trC{yy*)C* 
= ti{xx*) -tiC{xy*)* 
= T^tT{SS* - CASS*), 

giving (31).n 

Of course, in practice, we may need to scale the system first, cf. (18)-(20). 

In the special case S = I we recover Tikhonov regularization. Thus Tikhonov 
regularization is the stochastically optimal regularization method under the weak 
qualitative assumption that x = w is reasonably bounded (corresponding to the case 
p = 0 of Section 2). However, as we shall see below, estimators of good accuracy are 
possible only when assuming p > 0, and then Tikhonov regularization is no longer 
optimal. 

In the applications, a will be small while t will be large enough to guarantee that 
A <C ||^>S'||2. This inequality guarantees that the regularization term in the above 
optimality result does not dominate the information in the system coefficient matrix. 

Note that A corresponds only roughly (within a factor 0(1)) to the A used in 
the deterministic discussion. In particular, while estimates of A (such as those found 
in Section 10 below) may be used to calculate x by Theorem 8.1, it appears dubious 
to use it in Theorem 4.1 with the optimal h computed from (6). 

For any fixed A, we may solve (30) by a direct or iterative method. However, 
when a singular value decomposition is computationally feasible, one can find explicit 
formulas that give cheaply the solution for many A simultaneously, an important 
consideration when A is unknown and must be found iteratively. 
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9. Using the singular value decomposition. For the standaxd choice (1) of 

S, where SS* = (^4*^)*', we note that the matrix C defined by (29) has the form (11) 
with 93 (i) = tP/{tP^^ + !)• Therefore we can use the singular value decomposition as 
before to simplify the formulas. 

Theorem 9.1. If SS* = (A* Ay and A = UHV* is a singular value decomposi- 
tion of A then the optimal linear estimator x = Cy can he computed from c = U*y as 
X = Vz, where, with A = cr/r, 



2p+l 
^ ^k 



and we have 



2p 
<^k 



(33) (ll^-*lli>=^'E-2p+2^.,> 

k ^k +^ 

(34) (cc*) = Diag((7f +^ + A^). 

Proof. If A = UT.V is a singular value decomposition of A, we can write the 
covariance matrix (yy*) calculated in (27) as 

W := (yy*) = T^AiA*A)PA* + cr^I = t^{AA*)p+^ + cr^I 

= u{t^{i:i:*)^^ +a^i)u*. 

The vector 

(35) c := U*y 
has a diagonal covariance matrix, 

(36) (cc*) = U*{yy*}U = t^(SE*)''+i + a^I. 

Now (32) takes the form 

C = T'^{A*AyA*W-^ = T^A*iAA*)PW-'^ 

= rH"E*(EE*)P(T^(EE*)P+i +a'^I)-W* = VDU* 

with the diagonal matrix 

(37) D = E*(EE*)P((EE*)P+i + A^I)'^. 
Therefore 

(38) X = Vz, 
where z = DU*y = Dc has the components 
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It remains to express the residual term (31). Since 

ss* = {A*A)p = v{i:*i:)pv*, 

CA = VDU*UT,V* = VD'EV*, 

we get the desired result from 

iW^-CyWi) =THriI-CA)SS* 

= tr V{I - DS)(EE*)f V" 
= tr(/ - DT,){^'E*)PV*V 
= r2tr(/-£IS)(SS*)P 

k 



□ 



The above formulas are related to those known for a long time in the classical 

technique of Wiener filtering (Wiener [36]) for the deconvolution of sequences or 
images, probably the earliest application of regularization techniques. The singular 
values cTfc correspond to the absolute values \Hk\ of the Fourier coefficients of the point 
spread function H defining a convolution operator A. The deconvolution of a sequence 
or image y is done by multiplying its Fourier transform y by the Fourier transform 
of the filter function, H*/{\H\'^ + Pn/Pg), where P„ is the noise power and Pg the 
power in a model sequence or image, and operations are elementwise. If one assumes 
that Pg — \H\^'P (which is an implicit smoothness assuinptu)n for the iniensilies of 
the true image or density) and writes c = arg(ff*)j/ and Pn — A^, the product takes 
the form (39). (All this finds its natural explanation by noting that the singular value 
decomposition of a circulant matrix can be written exphcitly in terms of the Fourier 
transforms.) For details see, e.g., Katsaggelos [19]. 

Order of convergence. The following result about the optimal convergence 
order reveals a difference between the stochastic and the deterministic situation. We 
assume that r = 0(1), and that a = 0{A) is small. 

Theorem 9.2. For any q € [0, 1], we have 

(40) - ^11^) < a^r^-^V^*' = 0(A2«) 
with 

(41) «, = gni-g)i-«^<Tf-^«(^^\ 
Proof. We rewrite (33) as 

/||^ _ ^||2^ _ ^2 V- (^rV-A)^'^ A2g-2 2p-2g(p+l) 



<.^(s. 



,2p-2«(p+l) 
k 



18 



A. NEUMAIER 



The bound (40) follows since cr^ A29-2 = o-2(cr/r)29-2 = ^-2-290-29 and 

x>o a;2 + 1 



(attained at the zero x = yjqli^ — q) of the gradient). □ 

Since in situations that need to be regularized we always have some tiny (7^, the 
constant a, is reasonable only when q < In particular, since we need > 0 in 
order that the bound (40) goes to zero as the model error a goes to zero, p must be 
strictly positive. 

The choice q = yields an a, proportional to the rank of A, the number of 

nonzero singular values. Thus we recover the deterministic convergence order only 
when the rank of A is not too large. For typical discretized function space problems, 
however, A has full rank, the dimension is large, and, usually, the singular values cr^ 
of A approach the singular values cr^ of the continuous problem, 

(7ft — >• (T^ as n — >• 00; 

cf. Hansen [18]. To have a bounded a, in this limit we can only use a reduced 

exponent 

» — e 

where e is a number such that 



and then have 



x-Cy\\l) = 0{K'% 



independent of the dimension. 

10. Estimating the regularization parameter. We now return to the prob- 
lem of estimating A. We base our discussion on tlie SA^D: lio\^- to adapt the estimation 
in the case when a SVD is not available is postponed to Section 11 (after (81)). The 
key is relation (34) that expresses the covariance matrix of the (high-dimensional) 
vector c in terms of the unknown parameters r and A = <t/t. Prom this relation, we 
get 

(42) {cD = a' + t2(t^p+^ for A; = 1, . . . , n. 
Now in general, relations of the form 

(43) {cl) = vk{e*) forfc = l,...,n 

allow a low-dimensional parameter vector 0* to be estimated from a single realization 
of the high-dimensional random vector c. Indeed, the estimation of variance compo- 
nents of random vectors is a classical problem in statistics (see, e.g., Barndorff- 
NiELSEN & Cox [2]). We shall give a direct and elementary derivation of a family of 

estimators. 
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Theorem 10.1. Suppose that c is a random vector satisfying (43). Let tj) he a 
strictly concave function, and define 

n 

(44) /(c,0) = {i^(:c,{e)) + i^'{vk{e)){4-vm)) 

k=l 

with weights a,k > 0. Then {f(c,6)) is bounded below, and any global minimizer 9 of 
{f{c, 9)) satisfies Vk{0) = Vk{0*) for k = 1, . . . ,n. 
Proof Write Vk = Vk{0),vl = Vk{9*). Then 

n 

{f{c,9)) - ^a,(^(i;fc) +^'(r;fc)(«^ -■j;fc)), 



{f{c,9*)) = Y,Mi>K) + o), 



k=l 

hence 



{Kc,6)) - {f{c,9*)) = Y^ak{^{vk)-^{vl)-^'{vk){vk-vl)). 

k=l 

Since i/j is strictly concave and at > 0, the fcth term in the sum is nonnegative, and 
vanishes only for Vk = v^. Hence {f{c,9)} > {f{c,9*)), with equality iff Vk = for 
all k.U 

Corollary 10.2. //, in addition, 9* is determined uniquely by the values of 
Vk{9*) then 9* is the unique global minimizer of {f{c,9)). 

The case of two parameters. For 

(45) (ci) =a^Hk+T^Xk for fc = l,...,n, 

the case of interest in our application, the estimation problem reduces to a global 
optimization problem in two variables cr and t. (We introduce fik and Afc since we 
shall Tiood it in Section 11.) It is possible to reduce this further to an optimization 

problem in the single variable 

(46) r = A2 = ffV-r' 

by restricting attention to the family of concave functions defined by 



for some parameter q < 1. In the following, exp denotes the exponential fimction. 

Theorem 10.3. Suppose that c is a random vector satisfying (45). Define, for 
given weights > 0, 



(48) 



7,(i) = ^afc(Afc + <Mftr"'4> 
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(49) PS) = J2'^kiXk+tfXk)\ 

f(f)^f logT9(^) + ^log/?,W «/0 / g < 1, 

^ ' -^'^ ' I log7o(t)+E«fclog(Afc+*Mfc)/E«fc ifq = G- 

Then (exp/5(t)) is bounded below, and t* — a^/r^ is the unique global minimizer of 
{expfq{t)). Moreover, 

(51) a' = t*{j,it*))/l3,it*). 

Proof. Condition (45) is equivalent to (43) with 

(52) VkiO) = 9itik + 92\k, 0* = (^^^ . 

We apply Theorem 10.1 with tjj = xpg given by (47), and consider the optimal point 
0 = 9*, and the associated t = t* = jr^. Since ^ is a minimizer of (/(c,^)}, the 
gradient 

V{/(c,^)) = ^aft(tA'(«fc) + riPk){{cl) - Vk) + 'ii>'{vk){-\)Y}vk 

vanishes. Since (<T^,T^)V?;ft = (cr^,r^)(/ife, A^)^ = Vk, we find 

^akij}" {vk){{cl) - Vk)vk = 0, 
and since V'g(^) = v^~^i'^q{v) = {q — l)v^~^ for the choice (47), we find 

(53) ^a,t;r'({c|)-i;fe)=0. 

Using (52), (48) and (49), this equation becomes T^^~'^{')^g(t)) - T^'^0q{t) = 0, giving 

Thus and cr^ can be expressed in terms of the single parameter t. Now (44) and 
(53) imply 

(55) (f{c,9)) = '£c^kMvk). 

For the limiting case g = 0, we find 

{f{c,9)) =J2(^klogVk = ^afc(logT^ +log(Afc + fpfc)) 
= /?o(log(7o(t)) - log,3o) + Y^ak log(Afe + tfik) 

= ,5olog{exp/o(^))-,3olog,3o. 

where /?o = is independent of t. Thus minimizing {f{c,0)) is equivalent to 

minimizing (exp/o(i)). And ior: q^Q we have 

9(/(c, 9)) + Y,(^k =Y. MQM'^k) + 1) = 1] akvl 
= T"''£ak{Xk+ttikr=7^'P,it) 



1-9 
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and this expression must therefore be minimized (for ^ > 0) or maximized (for q < 
0) to get the optimal t. By taking the logarithm and dividing by q, we reverse 
the monotony behavior for g < 0; thus minimizing {f(c,$)) is again equivalent to 
minimizing {expfq{t)). Thus the theorem follows from the previous result.D 



Choice of weights. We now consider the choice of weights in (48)-(49). Since 
the weights are not random, the must be independent of the Cfe. It is desirable 
that the objective function does not change when condition (45) is rescaled by positive 
scale factors TTfc. Therefore the weights must be chosen in such a way that the trans- 
formation Afc = TTfcAfc, /tfc = TTfcjttfe, Ck = T^lf^Ck transforms the into a/t = Tr^'afc. 
This is the case for the choices 

Oik = ^kfJ^k^ q = -r-s; 

to ensure that tiny values of and //^ do not cause trouble we need r, s > 0. Since 
q is determined by r and s, it is convenient to use the index pair rs in place of q. 
For r = s = 0, we find the generalized maximum likelihood (GML) merit function 

(56) /oo(<) = log ^ + log(^fe 

(Indeed, under the additional assumption that the are independent Gaussian ran- 
dom variables with zero mean, the global minimizer t of foo{t) is the traditional 
maximum likelihood estimator for t*.) 

If /• > 0 or s > 0, we find the {r, s)- generalized cross validation (GOV) merit 
function 

(57) fr.it) = log7„(t) - log^„(t), 

r + s 



where 
(58) 



(59) >«(*) = T ( ( T-^Y ( ^^-^ ■ 

^ ' ' ^' '^\\k + tiJ.kJ \Xk + tiikJ \Xk + tiikJ 

(Indeed, in an important special case discussed in Section 11 below, the global mini- 
mizer of the (0,1)-GCV merit function is the familiar GOV estimate for the regular- 
ization parameter.) 

Practical use. For the regularization problem, the above results apply with 

(60) Afe = af+2, /xfc = l 

(compare (42) and (45)). 

In practice, the expectation (exp/rs(i)} cannot be computed since only a single 

realization is known. However, the theorem suggests that one can estimate t by finding 
the global minimizer of exp frs{t) or equivalently of frs{t), where Afc and //fc are given 
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Table 1 
Failure rates in percent 
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in terms of p by (60). If p is unlaiovvn, one can also minimize over a set of values 
p= 1, 2, 3... to find tiie best order of differentiability. 

Assuming that the Cfe are independent Gaussian random variables with zero mean, 
and with some technical additional assumptions, it can be proved that the resulting 
estimate i for f (and p for p) is an asymptotically unbiased estimate as n — >• oo; see, 
e.g., Welsh [35]. In particular, this holds for the maximum likelihood estimator (56), 
whose asymptotic properties are well-known (see, e.g., Barndorff-Nielsen & Cox 
[2]). 

To find the best values for r and s, we note first that the results only depend on 
the distribution of the quotients 



We therefore performed a simulation study of the estimation properties for various 
choices of these quotients. We looked at two different kinds of distributions repre- 
sentative of realistic problems, namely algebraic decay, qu = qihr"\ and exponential 
decay, qu = gie^''*, where gi and a are chosen such that the qk range between 10^*' 
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Table 2 

Number of first, second and third places 
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and 10''- , with constants ei and 6-2 that were varied independently in {1, 2, 4, 8, 16, 32} 
(if ei and 62 had different signs, estimation was extremely unreliable). Then we chose 
(T^ = 10~'^>, T = 1, Afe = cr^/qk and /Lt^ = 1, and minimized for each case the curves 
for frg{t) with r, s £ {0, |, 1, l|, 2}, using samples of various sizes n of independently 
distributed Gaussian noise with mean zero and variance given by (45). 

For the simulation, the minimization was done by evaluation at t = lO^t* , where 
t* — a'-^/r'^ and g £ {—2.0, —1.9, 1.9, 2.0}. If the smallest function value occured 
for 1^1 > 1 for all values of (r, s) under study, the estimation was classified as failure. 

Failure rates decrease with increasing dimension n; some are reported in Table 1. As 
to be expected, when the dimension n is small, there is a larger range of distributions 
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of the Qk where the estimates are unreUable for all choices of (r, s). 

If the minimum occured at some \g\ < 1 for at least one value of (r, s), we ranked 
the pairs (r, s) by the value of 1^1, giving gold medals for the smallest 1^1, and silver 
and bronce medals to the next smallest. (But if there were three gold medals, no 
other medals were given, and if there were two gold medals or two silver medals, no 
bronce medals were given.) The ranking by number of medals is given for n = 500 and 
3600 test cases in Table 2; the fact that there are much fewer silver medals and very 
few bronce medals shows that often several choices of (r, s) share the best position. 
The GML merit function (with r = s = 0) is the clear winner. For smaller n or for 
smaller values of the threshold defining failures, the behavior is similar, though there 
are fewer gold medals, and GML remains uniformly at the top of the list, both in 
terms of gold medals and in terms of total number of medals. 

Thus one may recommend to estimate t = A'^ and p by minimizing the GML 
merit function (56), and then to use Theorem 8.1 to get x. A suitable starting point 
for the minimization is given by 

i = median(Afc//ifc), p = l. 

(With this starting point, the local univariate minimizer I used, based on a bracket 
search and local parabolic interpolation, cf. Gill et al. [10], usually takes about 7-15 
function values, most typically 9, to find a local minimizer t*.) 

To assess more completely the impact of various methods to estimate t and p one 
also needs to study how sensitive the accuracy of a computed solution x depends on 
the choice of t and p. The only investigation in this direction known to me is a paper 
by Wahba [34] that shows that there are circumstances under which (0,1)-GCV is 
more robust than GML when p is misspecified. It is unknown whether this persists 
when p is estimated together with t from (0,1)-GCV or GML. 

11. More flexible smoothness constraints. We now look at the regulariza- 
tion of general linear stochastic models 

(61) y = Ax + e, (ee*) = a^V, 

where A e JR"^^"', a'^V € R™^'" is a symmetric and positive definite covariance 
matrix, and a is typically unknown. 

The qualitative constraint is assumed to be given in the form that certain com- 
ponents or linear combinations of the state vector cannot become large, and we can 
formulate this with a suitable matrix J e R'"'^" by assuming 

(62) Jx = w 

with w of reasonable norm. In the applications, J is usually a matrix of suitably 
weighted first or second order differences that apply to some part of x containing 
function values or densities. For problems involving images, x often contains grey 
values or color intensities, interpreted as a two- or (for movies or multiple slices 
through an object) three-dimensional function of the position. Then (62) is again a 
smoothness condition, and by judicious choice of J, the smoothness requirements can 
be adapted to particular problems. 

Modeling smoothness by (62) has the advantage that the smoothness condition 
may be unrelated to A. In particular, one can use it also for problems where A is 
well-conditioned but the solution of Ax = y would lead to an unacceptably rough 
solution X. An important case where A is the identity matrix is that of curve or 
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image smoothing. Here y contains the function values of a noisy curve or the pixel 
intensities of a noisy image, and one wants to reconstruct the smooth original x. 
(For the relaxed requirement that x is only piecewise smooth, one needs nonlinear 
regularization terms, that must be handled iteratively. See, e.g., [21, 32].) 
Our assumptions lead to the conditions 

(63) \\J^\\ of reasonable size, 

(64) {{Ax - y)*V-\Ax - y)) = a'^m. 

Traditionally, one interprets (63) by adding a multiple of (|| Ja;|p) as penalty to (64), 
and solving the associated least squares system 

(65) a-2(y - Ax)*V-^{y - Ax) + t-'-'H J-cir-" = min! 

With t = := a^/r^ as regularization parameter, the solution can be found from 
the normal equations 

(66) Btx = A*V-'^y, where Bt = A^V'^ A + 1 J* J, 

These equations are again a generalization of Tikhonov regularization, which is ob- 
tained as the special case V = I, J = I. 

If J is square and nonsingular, we may rewrite (62) as a; = Sw with J = S~^. 
If = / (which can always be enforced by a transformation of e), we may use the 
stochastic setting and obtain from Theorem 8.1 the optimal estimator x = Cy = 
(55*vl*A + A2/)-i5S*^*S/ = {A*A + /\^rj)A*y, and this formula agrees with (66). 
Therefore, (65) is the natural generalization of the optimal situation considered before 
to the present context. 

For large scale problems, (66) can be solved using one Cholesky factorization 
for each value of i, and we show below how these factorizations can be used to find 
an appropriate regularization parameter, too. Elden [7] observed that a significant 
amount of work can be saved by first reducing the problem to bidiagonal form. For full 
matrices, this can be done with O(n^) work, and each subsequent fvmction evaluation 
costs only 0{n) operations. Unfortunately, the method is limited to dense problems 
(banded problems work too, but only for J = I) since sparsity in A and /or J is 
destroyed by bidiagonalization. For problems with banded A and J one can still save 
a little work by first reducing A and J to banded triangular form; see Section 6.3.4 
of Bjorck [4]. 

If the sparsity structure of B is such that the computation of several Cholesky 
factorizations is not practical, iterative methods have to be used. For details see the 
references given in Section 5. 

Using the generalized SVD. We now show that we can choose a special coor- 
dinate system where both (63) and (64) become separable and hence easy to analyze. 

Theorem 11.1. (i) There are nonsingular matrices M € H"^" and N € H™^™, 
and nonnegative diagonal matrices D € K^^^ and E e R"^" such that 

(67) N*N = V-'^, NA = DM, J*J = M*E*EM. 
(it) Whenever (67) holds, the transformed variables 



(68) 



z:=Mx, c:=Ny 
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satisfy 
(69) 



llJxIl^ = \\Ez\\\ 



(70) 



{Ax - y)*V-'^{Ax -y) = \\Dz - c\\ 



Proof. We give a fully constructive proof that directly translates into an algorithm 
for computing M, N, D and E. We begin by factoring V = LL* and consider, for some 
/? 7^ 0, the C?i?-factorization 

(71) {^p/) = Q = (q^) ^ IR*™^'"^''" , Q*Q = I 
and upper triangular i? € R"*^". Thus 

(72) L-^A = QiR, pJ = Q2R' QlQi + Q-2Q2 = I- 

Using the singular value decomposition 

(73) Qi = UDW* with orthogonal U € R™''™, W € H"''" 
and a nonnegative diagonal matrix D e H™'^" we define 

(74) M := W*R , N := U*L-^ . 

From (73) we find D*D = (U*QiW)*(U*QiW) = W*QlQiW, so that the diagonal 
matrix 

(75) / - D*D = W*W - W*QlQiW = W*Q\Q2W = {QiWyiQiW) 

is positive semidefinite. Therefore its entries are nonnegative, and we can form the 
nonnegative diagonal matrix 

(76) E:=p-\I-D*D)^/^, 
with component-wise square roots. Now 

N*N=L-'^UU*L-^ = L-'^L-^ = {LL*)''^ = T/-\ 
NA = U*L-^A = U*QiR = U*UDW*R = DW*R = DM, 
pj = Q.^R = Q2WW*R = Q-2WM, 

p^J*J= {Q2WM)*{Q2WM) = M*(Q2Wy{Q2W)M = M*(/ - D*D)A4 
= p^M*E*EM, 

since is a square diagonal matrix. This proves (67). We now conclude from (67) 
and (68) that 

\\Jxf = x*J*Jx = x*{EM)*{EM)x = IIEMxf = WEzW^, 



(Ax-y)*V-'^iAx-y) = (Ax -y)*N*N{Ax -y) 

= \\N{Ax-y)f = \\Dz-cf. 
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□ 



When V = I, the factorization (67) is generally referred to as a generalized sin- 
gular value decomposition. See van Loan [22, 23], Stewart [30] and Paige [27] for 
further properties of the generalized SVD. The case V I seems not to have been 

discussed before. 

An implementation of the transformation may proceed according to (71), (73) 
and (76). In (71) one should choose 

(77) p=\\L-'A\U\\J\\^ 

or a similar expression in order to ensure that the two matrices on the left hand side 
have similar magnitude. This guarantees a stable computation when V and hence L 
are well-conditioned. (See van Loan [23] for the stability of algorithms for computing 
the GSVD. The counterexamples to the stability of the GSVD computed using (73) 
given there do not apply if we use the matrices M and N only implicitly, as indicated 
below.) 

Stijf models (61), where V is ill-conditioned, arise in some path tracking applica- 
tions where there is noise on two different time scales (small system noise and larger 
measurement noise). In this case, the numerical solution according to (71), (73) and 
(76) may suffer from instability, and a stable formulation is unknown. 

Since by (75), the diagonal entries of D lie in the interval [0, 1], we can replace in 
(73) any computed singular values > 1 (produced by finite precision calculations) by 
1. Then the calculation of E in (76) causes no problem. 

Having determined the variances, we can find z by solving the least squares prob- 
lem 



(78) (7-2 ||c - Dzf + T-^\\Ezf = min! 

corresponding to (65). After multiplication by cr^ we find the solution z from the 
normal equations 

{D*D + tE*E)z = D*c, 
where t = cr^/r^. Since D and E are diagonal, we obtain 

-ttt; =:>- tor k <n, 

0 for A; > n. 

Note that c can be computed from y by 

(79) c = U*{L-^y); 

and since = R~^W~'^ = R~^W, the solution estimate x can be recovered from 
z by 

(80) x = R-^{Wz). 

Therefore, the matrices M and A'' in (74) need not be formed explicitly. The vectors 
(79) and (80) can be efficiently computed by triangular solves and matrix-vector 

multiplications. 
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Finding the regularization parameter. In the new coordinates, the model 
equation (61) implies that 

e:=c-Dz = Ny- DMx = N{y - Ax) = Ne 

satisfies 

(ee*) = N{ee*)N* = a'^NVN* = cr^i 

and in particular, 

(el) = for fc = 1, . . . ,m. 

On the other hand, the qualitative condition (63) says that \\Ez\\ is of reasonable size. 
In analogy with Section 8, we now consider the class of problems characterized by 
random vectors z such that w = Ez satisfies 

with some r = 0(1). 

If we also make the natural assumption that the Wk and the Ik are uncorrelated, 
(wkek) = 0, the variances cr^ and can again be determined using Theorem 10.3. 

Indeed, we have 

Cfc = EkkCk = DkkEkkZk + Ekk^k = DkkWk + Ekk^k, 

so that 

(c|) = DlkT^ + El^a^ for A; = 1, . . . , n = min(m, n). 
Hence Theorem 10.3 applies with 

(81) Ck=EkkCk, \k=Dlk, h=Elk 

in place of Ck, Xk and /xfc, respectively. 

For large scale problems, the generalized SVD is prohibitively expensive to com- 
pute or store. Therefore it is important that the above formulas can bo rewritten in 
terms of the original matrices. This allows them to be used for problems with large 
and sparse A and J, provided that V (and hence N) are diagonal or at least block 
diagonal. This is the case in a large number of applications. 

To rewrite the merit functions frs{t) for finding the regularization parameter t, 
we note first that (67) implies 

Bt = A*V-'^A + tJ*J = M*{D*D + tE*E)M; 

therefore 

Pt ■- NAB^r^A*N* = D{D*D + tE*E)-^D* 
is a diagonal matrix. Using (81) we get 
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and 

Using (58) and (59) with the barred quantities, this yields 

/?„(*) = ^-nrP;(/-p<)^ 

7rsit) = t-'-\NyrP;iI - Ptr+\Ny). 
Therefore, if r + s > 0, we get 

frsit) = hgiNyrPlil - Pty+\Ny) - logtr P;(J - py - ;:^logt. 

In the special case r = 0 and s = 1, the merit function becomes 

foiit) = log ||(J - Pt)Nyf - 21ogtr(/ - P*) = log(GCF/n), 

where 

i||(/ -A)iVyf 



GCV = 



(itr(/-Pt))2 



is the well-known generalized cross validation (GCV) formula of Craven &; Wahba 
[6]. The need for Pt, generally a full matrix, causes computational difficulties for 
sparse matrices. However, the product PtNy and the trace tr Pt can be computed 
at about three times the cost of a factorization of Bt, without forming Pj explicitly, 
thus making the formula useful as long as Bt has a sparse factorization; details can 
be found, e.g., in Sections 6.7.4 and 5.3.5 of Bjorck [4]. The case where r = 1 and 
,s = 0 stKuns to be new but can be handled in the same way. It is unknown whether 
fast methods exist that permit in the sparse case the efficient evaluation of the merit 
functions with r + s > 1 or nonintegral r or s. 

On the other hand, the limiting case r = s = 0 is even simpler. To write the 
formula in terms of the untransformed matrices, we assume for simplicity that n = n, 
i.e., m > n. Then 

log det Bt = 2 log det M + log det{D*D + tE*E) 

(82) = 21ogdetM + logn(-DL+*-E^fe) 

= 2 log det M + ^ log( Afc + tjUfc ) , 

so that, up to an irrelevant constant, 

(83) ho{t) = \og{NyY{I - P){Ny) - \ogt+ ^ logdetP*. 

If m = n and E is nonsingular, the expression (82) can also be written as 

log det Bt — const +n log t — log det(/ — Pt), 
showing that, again up to an irrelevant constant, foo{t) = logGML, where 

Vdet(/ - Pt) 
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is the generalized maximum likelihood formula of Wahba [33]. This also holds in the 
general case, but the expression given for GML must then be modified since one must 
take account of zero eigenvalues of / — P^. No such modification is needed for (83). 

Function values for the GML merit function (83) can be computed efficiently from 
a sparse Cholesky factorization 

A*V-^A + tJ*J = LtL* 

as 

(84) food) = \og{y*V-'y - \\utf) - log* + | logdetL*, 
where Ut is the solution of the triangular linear system 

(85) Ltut = A*V-^y. 

Each function evaluation requires a new factorization; therefore an efficient univariate 
minimizer should be used. A suitable starting value for the minimization is t = 
0.01tryl*y~^yi/tr Once a good regularization parameter t is determined, the 
solution X of the least squares problem (66) is found by completing (85) with a back 
substitution, solving the triangular linear system 

(86) L^x = Ut. 

The evaluation of the GCV formula is more exponsive; it requires, in addition 
to the factorization needed to compute PtNy, significant additional work for the 
computation of tr P*. It is gratifying that GML, the computationally simplest formula 
was found in the previous section to be also the most efficient one. 

More general regularization problems. In a number of applications, there 
are several qualitative constraints of different origin. If one formulates each such 
constraint as the condition that some linear expression J^x is assumed to be well- 
scaled and not too large, one may again take account of these constraints as penalty 
terms in the least squares problem. In analogy to (66), we get the solution as 

(87) Btx = A*V-^y, 
but now 

Bt = A*V-^A + '^tlJ:J, 

y 

contains several regularization parameters fj,, one for each qualitative constraint. In 
some cases, one may assume that the tp are proportional to some known constants, 
thereby reducing the problem to one \'.'itli a single regularization parameter (the pro- 
portionality factor). However, more frequently, no information is available about the 
relative size of the f„, and all regularization parameters must be determined simul- 
taneously. The GML criterion generalizes in a natural way; (84)-(86) remain valid, 
but now Lt is a Cholesky factor of Pj, and the vector t of regularization parameters 
may be found by a multivariate minimization of /oo(0- The more expensive GCV 
merit function extends in a similar way to this situation (Wahba [34]). See also Gu 
& Wahba [13]. 
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